In this series of worksheets I'll show how you can use Javascript with node.js to acquire data from web services or web scraping and how you can manipulate that data into the right shape for further analysis and visualisation
Data applications come in all shapes and sizes. A common challenge faced by most of them is how to grab some data and get it into the right shape for your application. The approach outlined here is to create a "data pipeline" to pull the data in, get it cleaned and in the right shape and then use it in your application:
Three techniques for acquiring data are shown. Firstly, for data in files, we can simply load the file. For data embedded in web pages we can use a technique called web scraping. For data provided by a web service we can use the Web API provided by that service:
The cleaning and shaping of the data can be a sizeable task in its own right. Various techniques can be used to address the challenges you will face in this task.
There are numerous technical approaches that can be used to implement these techniques. Python has many useful libraries like Beautiful Soup and Selenium for web scraping, requests for calling web APIs and Pandas for loading files and manipulating the data. Spreadsheets like Excel can provide a lot of data handling facilities. It's tempting to use spreadsheets as the learning curve is relatively low and most people already have some experience with them. However, they may not always provide a robust and repeatable way to implement your data pipeline.
These worksheets focus on using Javascript and node.js. Historically, Javascript has been used in the browser to provide dynamic behaviour for web applications. Node.js is an environment for running Javascript outside of the browser. One big advantage of this is that if the destination for the data is a browser-based Javascript application (e.g. a P5 application or D3 data visualisation) then coding the data preparation can be done with the same language. So you don't, for example, have to learn Javascript for the browser-based applicaation and Python for the data preparation.
So, the architecture we are working towards looks something like this: